1,272 research outputs found

    Performance Models for Data Transfers: A Case Study with Molecular Chemistry Kernels

    Get PDF
    With increasing complexity of hardwares, systems with different memory nodes are ubiquitous in High Performance Computing (HPC). It is paramount to develop strategies to overlap the data transfers between memory nodes with computations in order to exploit the full potential of these systems. In this article, we consider the problem of deciding the order of data transfers between two memory nodes for a set of independent tasks with the objective to minimize the makespan. We prove that with limited memory capacity, obtaining the optimal order of data transfers is a NP-complete problem. We propose several heuristics for this problem and provide details about their favorable situations. We present an analysis of our heuristics on traces, obtained by running 2 molecular chemistry kernels, namely, Hartree-Fock (HF) and Coupled Cluster Single Double (CCSD) on 10 nodes of an HPC system. Our results show that some of our heuristics achieve significant overlap for moderate memory capacities and are very close to the lower bound of makespan

    Improving Performance of Iterative Methods by Lossy Checkponting

    Get PDF
    Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. To this end, significantly reducing the checkpointing overhead is critical to improving the overall performance of iterative methods. Our contribution is fourfold. (1) We propose a novel lossy checkpointing scheme that can significantly improve the checkpointing performance of iterative methods by leveraging lossy compressors. (2) We formulate a lossy checkpointing performance model and derive theoretically an upper bound for the extra number of iterations caused by the distortion of data in lossy checkpoints, in order to guarantee the performance improvement under the lossy checkpointing scheme. (3) We analyze the impact of lossy checkpointing (i.e., extra number of iterations caused by lossy checkpointing files) for multiple types of iterative methods. (4)We evaluate the lossy checkpointing scheme with optimal checkpointing intervals on a high-performance computing environment with 2,048 cores, using a well-known scientific computation package PETSc and a state-of-the-art checkpoint/restart toolkit. Experiments show that our optimized lossy checkpointing scheme can significantly reduce the fault tolerance overhead for iterative methods by 23%~70% compared with traditional checkpointing and 20%~58% compared with lossless-compressed checkpointing, in the presence of system failures.Comment: 14 pages, 10 figures, HPDC'1

    Scheduling data flow program in xkaapi: A new affinity based Algorithm for Heterogeneous Architectures

    Get PDF
    Efficient implementations of parallel applications on heterogeneous hybrid architectures require a careful balance between computations and communications with accelerator devices. Even if most of the communication time can be overlapped by computations, it is essential to reduce the total volume of communicated data. The literature therefore abounds with ad-hoc methods to reach that balance, but that are architecture and application dependent. We propose here a generic mechanism to automatically optimize the scheduling between CPUs and GPUs, and compare two strategies within this mechanism: the classical Heterogeneous Earliest Finish Time (HEFT) algorithm and our new, parametrized, Distributed Affinity Dual Approximation algorithm (DADA), which consists in grouping the tasks by affinity before running a fast dual approximation. We ran experiments on a heterogeneous parallel machine with six CPU cores and eight NVIDIA Fermi GPUs. Three standard dense linear algebra kernels from the PLASMA library have been ported on top of the Xkaapi runtime. We report their performances. It results that HEFT and DADA perform well for various experimental conditions, but that DADA performs better for larger systems and number of GPUs, and, in most cases, generates much lower data transfers than HEFT to achieve the same performance

    SWIFT: Using task-based parallelism, fully asynchronous communication, and graph partition-based domain decomposition for strong scaling on more than 100,000 cores

    Get PDF
    We present a new open-source cosmological code, called SWIFT, designed to solve the equations of hydrodynamics using a particle-based approach (Smooth Particle Hydrodynamics) on hybrid shared / distributed-memory architectures. SWIFT was designed from the bottom up to provide excellent strong scaling on both commodity clusters (Tier-2 systems) and Top100-supercomputers (Tier-0 systems), without relying on architecture-specific features or specialized accelerator hardware. This performance is due to three main computational approaches: • Task-based parallelism for shared-memory parallelism, which provides fine-grained load balancing and thus strong scaling on large numbers of cores. • Graph-based domain decomposition, which uses the task graph to decompose the simulation domain such that the work, as opposed to just the data, as is the case with most partitioning schemes, is equally distributed across all nodes. • Fully dynamic and asynchronous communication, in which communication is modelled as just another task in the task-based scheme, sending data whenever it is ready and deferring on tasks that rely on data from other nodes until it arrives. In order to use these approaches, the code had to be re-written from scratch, and the algorithms therein adapted to the task-based paradigm. As a result, we can show upwards of 60% parallel efficiency for moderate-sized problems when increasing the number of cores 512-fold, on both x86-based and Power8-based architectures

    Large non-Gaussian Halo Bias from Single Field Inflation

    Full text link
    We calculate Large Scale Structure observables for non-Gaussianity arising from non-Bunch-Davies initial states in single field inflation. These scenarios can have substantial primordial non-Gaussianity from squeezed (but observable) momentum configurations. They generate a term in the halo bias that may be more strongly scale-dependent than the contribution from the local ansatz. We also discuss theoretical considerations required to generate an observable signature.Comment: 30 pages, 14 figures, typos corrected and minor changes to match published version JCAP09(2012)00

    Loop Quantum Gravity and the The Planck Regime of Cosmology

    Full text link
    The very early universe provides the best arena we currently have to test quantum gravity theories. The success of the inflationary paradigm in accounting for the observed inhomogeneities in the cosmic microwave background already illustrates this point to a certain extent because the paradigm is based on quantum field theory on the curved cosmological space-times. However, this analysis excludes the Planck era because the background space-time satisfies Einstein's equations all the way back to the big bang singularity. Using techniques from loop quantum gravity, the paradigm has now been extended to a self-consistent theory from the Planck regime to the onset of inflation, covering some 11 orders of magnitude in curvature. In addition, for a narrow window of initial conditions, there are departures from the standard paradigm, with novel effects, such as a modification of the consistency relation involving the scalar and tensor power spectra and a new source for non-Gaussianities. Thus, the genesis of the large scale structure of the universe can be traced back to quantum gravity fluctuations \emph{in the Planck regime}. This report provides a bird's eye view of these developments for the general relativity community.Comment: 23 pages, 4 figures. Plenary talk at the Conference: Relativity and Gravitation: 100 Years after Einstein in Prague. To appear in the Proceedings to be published by Edition Open Access. Summarizes results that appeared in journal articles [2-13

    Global Time Distribution via Satellite-Based Sources of Entangled Photons

    Full text link
    We propose a satellite-based scheme to perform clock synchronization between ground stations spread across the globe using quantum resources. We refer to this as a quantum clock synchronization (QCS) network. Through detailed numerical simulations, we assess the feasibility and capabilities of a near-term implementation of this scheme. We consider a small constellation of nanosatellites equipped only with modest resources. These include quantum devices such as spontaneous parametric down conversion (SPDC) sources, avalanche photo-detectors (APDs), and moderately stable on-board clocks such as chip scale atomic clocks (CSACs). In our simulations, the various performance parameters describing the hardware have been chosen such that they are either already commercially available, or require only moderate advances. We conclude that with such a scheme establishing a global network of ground based clocks synchronized to sub-nanosecond level (up to a few picoseconds) of precision, would be feasible. Such QCS satellite constellations would form the infrastructure for a future quantum network, able to serve as a globally accessible entanglement resource. At the same time, our clock synchronization protocol, provides the sub-nanosecond level synchronization required for many quantum networking protocols, and thus, can be seen as adding an extra layer of utility to quantum technologies in the space domain designed for other purposes.Comment: 20 pages, 12 figures and 6 tables. Comments are welcom

    Nanoscale piezoelectric response across a single antiparallel ferroelectric domain wall

    Full text link
    Surprising asymmetry in the local electromechanical response across a single antiparallel ferroelectric domain wall is reported. Piezoelectric force microscopy is used to investigate both the in-plane and out-of- plane electromechanical signals around domain walls in congruent and near-stoichiometric lithium niobate. The observed asymmetry is shown to have a strong correlation to crystal stoichiometry, suggesting defect-domain wall interactions. A defect-dipole model is proposed. Finite element method is used to simulate the electromechanical processes at the wall and reconstruct the images. For the near-stoichiometric composition, good agreement is found in both form and magnitude. Some discrepancy remains between the experimental and modeling widths of the imaged effects across a wall. This is analyzed from the perspective of possible electrostatic contributions to the imaging process, as well as local changes in the material properties in the vicinity of the wall

    Remarks on the renormalization of primordial cosmological perturbations

    Get PDF
    We briefly review the need to perform renormalization of inflationary perturbations to properly work out the physical power spectra. We also summarize the basis of (momentum-space) renormalization in curved spacetime and address several misconceptions found in recent literature on this subject.Comment: 5 page
    • …
    corecore